tidyverse library ver. 1.3.1 is used for processing
data.
This library include all required libraries:
- ggplot2 3.3.5, for data visualization.
- dplyr 1.0.8, for data manipulation.
- tidyr 1.2.0, for data tidying.
- readr 2.1.2, for data import.
- purrr 0.3.4, for functional programming.
as well as tibble ver. 3.1.6, stringr ver 1.4.0, ver. forcats 0.5.1.
Also readxl used for read .xls files.
For data processing and analysis car,
caTools, caret, lmtest,
multcomp, vegan, vegan3d,
plot3D, scatterplot3d, gridExtra,
and limma were used.
We load data from all .xls into one variable.
.xls files with raw data should be placed in
./data/ folder.
There are 1080 rows and 82 variables in the data.
There are 72 mice in experiment. All mouses belongs to one of 8 classes:
Classes: - c-CS-s: control mice, stimulated to learn, injected with saline - c-CS-m: control mice, stimulated to learn, injected with memantine - c-SC-s: control mice, not stimulated to learn, injected with saline - c-SC-m: control mice, not stimulated to learn, injected with memantine
There are mouses with 2 genotypes and 2 behavior under 2 types of treatment in dataset:
If we drop rows with NA in any column:
When deleting rows with missing values, the data is not balanced across groups. In two cases, the control group is twice as large as the group of trisomic mice. When counted with rows with missing values, the data is more or less balanced.
The number of complete observations is 528 (there are 1080 observations in total).
Violin and box-plots for each protein (after standardization) are in
..\data\plots\ folder.
Delete Outliers using by Interquartile Range Rule and NA in each class.
We have a sufficient number of observations to consider the distribution of BDNF_N variable as a normal distribution, which can be seen on the graph:
We fit a linear regression model to the data and perform a Shapiro-Wilk statistical test to test the residuals for normality:
##
## Call:
## lm(formula = BDNF_N ~ class, data = mouse_wo_outl)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.13623 -0.02822 -0.00125 0.02866 0.13426
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.33834 0.00370 91.43 < 2e-16 ***
## classc-CS-s 0.00282 0.00538 0.52 0.6001
## classc-SC-m -0.04621 0.00523 -8.83 < 2e-16 ***
## classc-SC-s -0.02418 0.00538 -4.50 7.7e-06 ***
## classt-CS-m -0.02676 0.00538 -4.98 7.6e-07 ***
## classt-CS-s -0.03712 0.00580 -6.39 2.4e-10 ***
## classt-SC-m -0.01728 0.00537 -3.22 0.0013 **
## classt-SC-s -0.01275 0.00540 -2.36 0.0184 *
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0452 on 1061 degrees of freedom
## Multiple R-squared: 0.114, Adjusted R-squared: 0.108
## F-statistic: 19.5 on 7 and 1061 DF, p-value: <2e-16
The residuals of the linear regression model is normally distributed. Shapiro-Wilk normality test p-value = 0.044 (number of observations is 1069, W statistics is 0.997).
We accept it as possible to carry out one-way analysis of variance (one-way ANOVA).
anova_test_BDNF <- Anova(mod_BDNF)
The levels of BDNF_N production depend on the class in the experiment (F = 19.465375, p_value = 1.29894^{-24} , df_1 = 7, df_2 = 1061).
To identify pairwise differences, the Tukey post-hock test was performed (df = 1061).
There are statistically significant differences in the levels of BDNF_N production depend on the class in the experiment when pairwise comparison of the production of various class:
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: lm(formula = BDNF_N ~ class, data = mouse_wo_outl)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## c-CS-s - c-CS-m == 0 0.00282 0.00538 0.52 0.9995
## c-SC-m - c-CS-m == 0 -0.04621 0.00523 -8.83 <0.001
## c-SC-s - c-CS-m == 0 -0.02418 0.00538 -4.50 <0.001
## t-CS-m - c-CS-m == 0 -0.02676 0.00538 -4.98 <0.001
## t-CS-s - c-CS-m == 0 -0.03712 0.00580 -6.39 <0.001
## t-SC-m - c-CS-m == 0 -0.01728 0.00537 -3.22 0.0284
## t-SC-s - c-CS-m == 0 -0.01275 0.00540 -2.36 0.2605
## c-SC-m - c-CS-s == 0 -0.04903 0.00538 -9.12 <0.001
## c-SC-s - c-CS-s == 0 -0.02700 0.00552 -4.89 <0.001
## t-CS-m - c-CS-s == 0 -0.02958 0.00552 -5.36 <0.001
## t-CS-s - c-CS-s == 0 -0.03994 0.00594 -6.73 <0.001
## t-SC-m - c-CS-s == 0 -0.02010 0.00551 -3.65 0.0067
## t-SC-s - c-CS-s == 0 -0.01557 0.00554 -2.81 0.0933
## c-SC-s - c-SC-m == 0 0.02203 0.00538 4.10 0.0012
## t-CS-m - c-SC-m == 0 0.01946 0.00538 3.62 0.0075
## t-CS-s - c-SC-m == 0 0.00910 0.00580 1.57 0.7694
## t-SC-m - c-SC-m == 0 0.02894 0.00537 5.39 <0.001
## t-SC-s - c-SC-m == 0 0.03346 0.00540 6.20 <0.001
## t-CS-m - c-SC-s == 0 -0.00258 0.00552 -0.47 0.9998
## t-CS-s - c-SC-s == 0 -0.01294 0.00594 -2.18 0.3647
## t-SC-m - c-SC-s == 0 0.00690 0.00551 1.25 0.9152
## t-SC-s - c-SC-s == 0 0.01143 0.00554 2.06 0.4393
## t-CS-s - t-CS-m == 0 -0.01036 0.00594 -1.75 0.6561
## t-SC-m - t-CS-m == 0 0.00948 0.00551 1.72 0.6728
## t-SC-s - t-CS-m == 0 0.01400 0.00554 2.53 0.1846
## t-SC-m - t-CS-s == 0 0.01984 0.00593 3.35 0.0189
## t-SC-s - t-CS-s == 0 0.02437 0.00595 4.09 0.0012
## t-SC-s - t-SC-m == 0 0.00452 0.00553 0.82 0.9921
##
## c-CS-s - c-CS-m == 0
## c-SC-m - c-CS-m == 0 ***
## c-SC-s - c-CS-m == 0 ***
## t-CS-m - c-CS-m == 0 ***
## t-CS-s - c-CS-m == 0 ***
## t-SC-m - c-CS-m == 0 *
## t-SC-s - c-CS-m == 0
## c-SC-m - c-CS-s == 0 ***
## c-SC-s - c-CS-s == 0 ***
## t-CS-m - c-CS-s == 0 ***
## t-CS-s - c-CS-s == 0 ***
## t-SC-m - c-CS-s == 0 **
## t-SC-s - c-CS-s == 0 .
## c-SC-s - c-SC-m == 0 **
## t-CS-m - c-SC-m == 0 **
## t-CS-s - c-SC-m == 0
## t-SC-m - c-SC-m == 0 ***
## t-SC-s - c-SC-m == 0 ***
## t-CS-m - c-SC-s == 0
## t-CS-s - c-SC-s == 0
## t-SC-m - c-SC-s == 0
## t-SC-s - c-SC-s == 0
## t-CS-s - t-CS-m == 0
## t-SC-m - t-CS-m == 0
## t-SC-s - t-CS-m == 0
## t-SC-m - t-CS-s == 0 *
## t-SC-s - t-CS-s == 0 **
## t-SC-s - t-SC-m == 0
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
Groups between which there was a statistically significant difference in BDNF_N protein production were marked with one *, two ** and three ***:
Count NA’s in each column and return column names which contains NA’s (with a number of Na’s):
## ELK_N MEK_N Bcatenin_N BAD_N BCL2_N
## 15 4 15 213 285
## pCFOS_N H3AcK18_N EGR1_N H3MeK4_N
## 75 180 210 270
Remove columns with more than 100 NA’s and after this - rows with NA.
There are 972 rows and 77 variables in the new data.
##
## Call:
## lm(formula = ERBB4_N ~ ., data = norm.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6162 -0.2644 -0.0043 0.2665 1.6251
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.316564 0.191278 1.65 0.09840 .
## DYRK1A_N -0.739740 0.445301 -1.66 0.09715 .
## ITSN1_N 0.431635 0.128979 3.35 0.00086 ***
## BDNF_N -0.006208 0.056456 -0.11 0.91247
## NR1_N -0.283760 0.109536 -2.59 0.00979 **
## NR2A_N -0.082195 0.074829 -1.10 0.27242
## pAKT_N 0.196285 0.067807 2.89 0.00392 **
## pBRAF_N -0.073160 0.049169 -1.49 0.13725
## pCAMKII_N -0.035047 0.049005 -0.72 0.47475
## pCREB_N -0.126384 0.055048 -2.30 0.02200 *
## pELK_N -0.035600 0.037029 -0.96 0.33669
## pERK_N -0.034678 0.069986 -0.50 0.62041
## pJNK_N -0.137350 0.069811 -1.97 0.04955 *
## PKCA_N 0.282964 0.071014 3.98 7.5e-05 ***
## pMEK_N 0.022133 0.064542 0.34 0.73176
## pNR1_N -0.141583 0.086192 -1.64 0.10093
## pNR2A_N 0.091970 0.066353 1.39 0.16619
## pNR2B_N 0.133880 0.086790 1.54 0.12341
## pPKCAB_N 0.011033 0.076577 0.14 0.88548
## pRSK_N 0.048000 0.049893 0.96 0.33637
## AKT_N 0.086348 0.061566 1.40 0.16123
## BRAF_N -0.259251 0.126212 -2.05 0.04036 *
## CAMKII_N 0.054620 0.063491 0.86 0.38995
## CREB_N -0.069386 0.046194 -1.50 0.13356
## ELK_N 0.036247 0.079986 0.45 0.65058
## ERK_N 0.248546 0.092547 2.69 0.00742 **
## GSK3B_N 0.000181 0.086711 0.00 0.99833
## JNK_N 0.053428 0.061683 0.87 0.38671
## MEK_N -0.039407 0.053434 -0.74 0.46109
## TRKA_N -0.058324 0.094674 -0.62 0.53808
## RSK_N 0.076056 0.064033 1.19 0.23536
## APP_N -0.046249 0.044811 -1.03 0.30240
## Bcatenin_N 0.105490 0.111162 0.95 0.34298
## SOD1_N -0.053581 0.045364 -1.18 0.23798
## MTOR_N 0.178843 0.061983 2.89 0.00404 **
## P38_N -0.021191 0.067464 -0.31 0.75354
## pMTOR_N -0.101488 0.059056 -1.72 0.08617 .
## DSCR1_N -0.037512 0.057192 -0.66 0.51212
## AMPKA_N -0.049791 0.081247 -0.61 0.54020
## NR2B_N 0.002035 0.048894 0.04 0.96681
## pNUMB_N -0.037937 0.064028 -0.59 0.55371
## RAPTOR_N 0.066184 0.072824 0.91 0.36378
## TIAM1_N 0.006683 0.068417 0.10 0.92221
## pP70S6_N 0.009730 0.049914 0.19 0.84550
## NUMB_N -0.109550 0.059067 -1.85 0.06409 .
## P70S6_N -0.057986 0.050614 -1.15 0.25235
## pGSK3B_N 0.041136 0.047073 0.87 0.38250
## pPKCG_N -0.335286 0.063631 -5.27 1.9e-07 ***
## CDK5_N 0.013850 0.026755 0.52 0.60487
## S6_N 0.065872 0.051731 1.27 0.20335
## ADARB1_N -0.040814 0.039372 -1.04 0.30029
## AcetylH3K9_N 0.135841 0.058159 2.34 0.01981 *
## RRP1_N -0.037075 0.023156 -1.60 0.10983
## BAX_N 0.014668 0.037831 0.39 0.69834
## ARC_N 0.167255 0.049050 3.41 0.00069 ***
## nNOS_N -0.031950 0.037210 -0.86 0.39084
## Tau_N 0.185728 0.062368 2.98 0.00301 **
## GFAP_N -0.041150 0.035860 -1.15 0.25159
## GluR3_N -0.062585 0.035806 -1.75 0.08095 .
## GluR4_N -0.035215 0.023093 -1.52 0.12775
## IL1B_N 0.075211 0.050746 1.48 0.13879
## P3525_N 0.193371 0.038600 5.01 7.0e-07 ***
## pCASP9_N 0.170040 0.045854 3.71 0.00023 ***
## PSD95_N 0.391634 0.047308 8.28 7.0e-16 ***
## SNCA_N 0.009827 0.047337 0.21 0.83561
## Ubiquitin_N -0.042071 0.049402 -0.85 0.39475
## pGSK3B_Tyr216_N 0.180801 0.052661 3.43 0.00063 ***
## SHH_N -0.000416 0.031477 -0.01 0.98946
## pS6_N NA NA NA NA
## pCFOS_N -0.038880 0.036897 -1.05 0.29239
## SYP_N 0.079747 0.034568 2.31 0.02137 *
## CaNA_N -0.115093 0.068908 -1.67 0.09535 .
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.446 on 658 degrees of freedom
## Multiple R-squared: 0.82, Adjusted R-squared: 0.801
## F-statistic: 42.8 on 70 and 658 DF, p-value: <2e-16
Residual standard error: 0.424 on 658 degrees of freedom
Multiple R-squared: 0.819925, Adjusted R-squared: 0.800768.
F-statistic: 43 on 70 and 658 DF, p-value: <2e-16.
Residuals vs Fitted plot - Linear relationship, that is good.
Normal Q-Q - the tails are observed to be ‘heavier’ (have larger values) than what we would expect under the standard modeling assumptions. It`s OK.
Scale-Location plot - let’s do Breusch Pagan Test to check for heteroskedasticity more formally:
Result: as expected, there is heteroskedasticity.
Residuals vs Leverage (observations whose standardized residuals are greater than 3 in absolute value are possible outliers) - there are few influential observations, but no outliers.
pred_y_lm <- predict(model, dplyr::select(norm.test, -ERBB4_N))
## Warning in predict.lm(model, dplyr::select(norm.test,
## -ERBB4_N)): prediction from a rank-deficient fit may be
## misleading
Warning can tell us us about multi-collinearity in data. It can lead to a rank deficient matrix in logistic regression. We can try applying PCA to tackle the multi-collinearity issue and then apply logistic regression afterwards.
MAE = 0.309649. The model contains too many predictors and some problems, it might be better.
The model contains too many predictors, heteroskedasticity, and also multi-collinearity in data so it might be better.
Run principal components analysis on all train set.
There are 70 principal components.
Let’s analyze the results using the eigenvalue plot:
You can see that there are maybe enough 7 principal components.
The Cumulative Proportion from summary:
90% of the variability can be explained by 17 main components. Let’s try to apply 17 main components to building a linear model.
Getting Principal Component Scores and Train data transformation.
Test data transformation.
##
## Call:
## lm(formula = ERBB4_N ~ ., data = pca_train_17)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8789 -0.3003 -0.0137 0.3078 1.4854
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.01250 0.01832 0.68 0.49536
## PC1 -0.07795 0.00414 -18.82 < 2e-16 ***
## PC2 -0.06810 0.00543 -12.55 < 2e-16 ***
## PC3 0.20823 0.00672 31.00 < 2e-16 ***
## PC4 -0.13707 0.00786 -17.44 < 2e-16 ***
## PC5 0.05942 0.00992 5.99 3.3e-09 ***
## PC6 0.09799 0.01032 9.49 < 2e-16 ***
## PC7 -0.04772 0.01187 -4.02 6.4e-05 ***
## PC8 0.00478 0.01285 0.37 0.71021
## PC9 0.15376 0.01439 10.69 < 2e-16 ***
## PC10 0.12291 0.01626 7.56 1.3e-13 ***
## PC11 -0.07667 0.01751 -4.38 1.4e-05 ***
## PC12 0.21888 0.01904 11.49 < 2e-16 ***
## PC13 -0.07218 0.02072 -3.48 0.00053 ***
## PC14 -0.04633 0.02093 -2.21 0.02715 *
## PC15 0.00506 0.02224 0.23 0.82018
## PC16 -0.10543 0.02368 -4.45 9.8e-06 ***
## PC17 0.09713 0.02491 3.90 0.00011 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.491 on 711 degrees of freedom
## Multiple R-squared: 0.765, Adjusted R-squared: 0.759
## F-statistic: 136 on 17 and 711 DF, p-value: <2e-16
R-squared: 0.76497, Adjusted R-squared: 0.75935.
(without PCA it was multiple R-squared: 0.819925, adjusted R-squared: 0.800768.)
MAE = 1.021314.
According to the R-squared the model became a bit worse, and the MAE has grown too.
Residuals vs Fitted plot - Linear relationship, that is good.
Normal Q-Q - the tails are observed to be ‘heavier’ (have larger values) than what we would expect under the standard modeling assumptions. It`s OK.
Scale-Location plot - let’s do Breusch Pagan Test to check for heteroskedasticity more formally:
Result: there is heteroskedasticity.
Residuals vs Leverage (observations whose standardized residuals are greater than 3 in absolute value are possible outliers) - there are few influential observations, but no outliers.
Expression levels for 72 proteins were assessed.
Let’s determine how many proteins changed their expression during
“stimulation to learning”. SC was a baseline.
## Down NotSig Up
## 40 9 23
When mouse are stimulated to learn, 40 proteins are downregulated, and 23 are upregulated.
Up-regulated proteins:
## [1] "DYRK1A_N" "ITSN1_N" "BDNF_N"
## [4] "NR2A_N" "pELK_N" "pERK_N"
## [7] "PKCA_N" "pPKCAB_N" "BRAF_N"
## [10] "ELK_N" "ERK_N" "GSK3B_N"
## [13] "TRKA_N" "APP_N" "pNUMB_N"
## [16] "NUMB_N" "pGSK3B_N" "CDK5_N"
## [19] "S6_N" "ADARB1_N" "GFAP_N"
## [22] "pGSK3B_Tyr216_N" "CaNA_N"
Down-regulated proteins:
## [1] "pAKT_N" "pBRAF_N" "pCAMKII_N"
## [4] "pCREB_N" "pJNK_N" "pMEK_N"
## [7] "pNR1_N" "pNR2A_N" "pNR2B_N"
## [10] "AKT_N" "CAMKII_N" "CREB_N"
## [13] "MEK_N" "RSK_N" "Bcatenin_N"
## [16] "SOD1_N" "MTOR_N" "P38_N"
## [19] "pMTOR_N" "DSCR1_N" "AMPKA_N"
## [22] "NR2B_N" "RAPTOR_N" "TIAM1_N"
## [25] "pP70S6_N" "AcetylH3K9_N" "RRP1_N"
## [28] "ARC_N" "nNOS_N" "Tau_N"
## [31] "IL1B_N" "pCASP9_N" "PSD95_N"
## [34] "SNCA_N" "Ubiquitin_N" "SHH_N"
## [37] "pS6_N" "pCFOS_N" "SYP_N"
## [40] "ERBB4_N"
Let’s determine how many proteins changed their expression when mice
received memantine. Saline was a baseline.
## Down NotSig Up
## 21 28 23
When mouse are received memantine, 21 proteins are down-regulated, and 23 are up-regulated.
Up-regulated proteins:
## [1] "pAKT_N" "pBRAF_N" "pCAMKII_N" "pJNK_N"
## [5] "pMEK_N" "pNR2A_N" "CREB_N" "MTOR_N"
## [9] "P38_N" "pMTOR_N" "DSCR1_N" "NR2B_N"
## [13] "RAPTOR_N" "ARC_N" "nNOS_N" "GluR3_N"
## [17] "IL1B_N" "pCASP9_N" "PSD95_N" "SNCA_N"
## [21] "Ubiquitin_N" "pS6_N" "ERBB4_N"
Down-regulated proteins:
## [1] "ITSN1_N" "BDNF_N" "NR1_N" "pPKCAB_N"
## [5] "AKT_N" "ELK_N" "ERK_N" "Bcatenin_N"
## [9] "SOD1_N" "NUMB_N" "pGSK3B_N" "pPKCG_N"
## [13] "CDK5_N" "S6_N" "ADARB1_N" "RRP1_N"
## [17] "BAX_N" "Tau_N" "GFAP_N" "P3525_N"
## [21] "CaNA_N"
Let’s determine how many proteins changed their expression for Ts65Dn
mouses. Control was a baseline.
## Down NotSig Up
## 18 34 20
When mouse have Ts65Dn genotype, 18 proteins are down-regulated, and 20 are up-regulated.
Up-regulated proteins:
## [1] "DYRK1A_N" "ITSN1_N" "pCREB_N"
## [4] "pRSK_N" "TRKA_N" "APP_N"
## [7] "TIAM1_N" "pP70S6_N" "NUMB_N"
## [10] "pGSK3B_N" "pPKCG_N" "CDK5_N"
## [13] "S6_N" "AcetylH3K9_N" "RRP1_N"
## [16] "BAX_N" "Tau_N" "GFAP_N"
## [19] "P3525_N" "pGSK3B_Tyr216_N"
Down-regulated proteins:
## [1] "NR1_N" "NR2A_N" "pNR1_N" "pNR2A_N" "pNR2B_N"
## [6] "AKT_N" "MTOR_N" "P38_N" "pMTOR_N" "AMPKA_N"
## [11] "NR2B_N" "ARC_N" "GluR3_N" "IL1B_N" "SNCA_N"
## [16] "pS6_N" "pCFOS_N" "SYP_N"
Let’s determine how class affected protein expression levels.
c-SC-s was a baseline.
Number of up- and -down-regulated proteins are in the table.
There are 1080 rows and 82 variables in the raw data.
There are 72 mice in experiment. All mouses belongs to one of 8 classes.
There are mouses with 2 genotypes and 2 behavior under 2 types of treatment in dataset.
The number of complete observations is 528 (there are 1080 observations in total).
Violin and box-plots for each protein (after standardization) are
in ..\data\plots\ folder.
The levels of BDNF_N production depend on the class in the experiment (F = 19.465375, p_value = 1.29894^{-24} , df_1 = 7, df_2 = 1061).
There are statistically significant differences in the levels of BDNF_N production depend on the class in the experiment when pairwise comparison of the production of various class (Tukey post-hock test).
Groups between which there was a statistically significant difference in BDNF_N protein production:
Linear model for ERBB4_N are built. re are 972 rows and 77
variables in the new data without NA.
Residual standard error: 0.424 on 658 degrees of freedom.
Multiple R-squared: 0.819925, Adjusted R-squared: 0.800768.
F-statistic: 43 on 70 and 658 DF, p-value: <2e-16.
MAE = 0.309649. The model contains too many predictors and some
problems, it might be better.
The model contains too many predictors,
heteroskedasticity, and also multi-collinearity
in data so it might be better.
We made PCA for Linear model. There are 70 principal component,
we used 17 (90% of the variability can be explained by 17 main
components).
R-squared: 0.76497, Adjusted R-squared: 0.75935.
(without PCA it was multiple R-squared: 0.819925, adjusted R-squared:
0.800768.)
MAE = 1.021314.
According to the R-squared the model became a bit worse, and the MAE has
grown too.
Search for differential proteins Using limma - Expression levels for 72 proteins were assessed.